By François Michonneau, @fmic_
Reproducibility is one the corner stones of the scientific process. How the numbers (e.g., statistics and p-values) found in your manuscript and where the data points that are making up your plots are coming from, should be made clear to your readers. As analytical methods are growing more sophisticated and data sets are becoming larger, it is both getting more important and easier to do.
It is getting more important as basic data manipulations that could be explained in words until recently, are becoming too intricate to explain in details when the analyses are conducted on large datasets that require programmatic modification of their content. The modifications brought to the data are often easier to share by providing readers the scripts that have been used to generate the datasets, statistics or figures included in the manuscript.
It is also getting easier to do as the programming languages commonly used in science provide effective ways to integrate the text for the manuscript and the code used to generate the results. This approach coined literate programming is exemplified in Python with the iPython notebooks, and in R with Rmarkdown and knitr. If these technologies facilitate literate programing, they are typically not taught as part of the traditional university curriculum, and yet these technologies have the potential to accelerate science while making results more robust and more transparent.
To me, the largest advantage of making my science reproducible and using literate programing is that it saves me a lot of time by automating my workflow. I do not have to worry about remaking figures manually or re-creating intermediate data sets when new data comes in or if errors are detected in the raw data. All of these outputs can easily be regenerated by running scripts.
On June 1st and 2nd, 21 people (including 5 remote participants) participated in the second Reproducible Science Workshop organized at iDigBio. The instructors were Hilmar Lapp (Duke Genome Center), Ciera Martinez (UC Davis), and myself (iDigBio). The helpers (Judit Ungvari-Martin, Deb Paul, Kevin Love) ensured that the workshop was running smoothly and assisted participants as needed.
Before the workshop, we asked participants to fill out a survey to get a sense of our audience and to assess what they were expecting from the workshop.
Most of the participants were graduate students from the Life Sciences who program everyday (but some programmed rarely), and almost all of them use R regurlarly (17 out of the 22 respondants1).
We also asked partcipants how they currently record their data and whether they feel confident that one of their colleagues could reproduce results and figures given the data and their notes. Most people reported using a lab notebook or online documents. The majority reported being “confident” or “somewhat confident” that their documentation was sufficient for their colleagues to reproduce their results, but nearly half the participants reported being “not very confident” while nobody chose the “Very confident” option.
Finally, we asked participants how often they share their code/data/analysis and why they are doing it. Interestingly, “opinion of colleagues” seems to be one of the main drivers of sharing code/data/analysis. Indicating that “peer-pressure” is perceived as being more important than requirements by journals and funding agencies among the early adopters.
We kicked off the workshop by asking participants the tools and methods they currently use to document their analyses. Even though we ask a similar question, we have found that this exercise is a great way to break the ice, and get the conversation started.
Ciera taught a module that highlights the common challenges associated with working in a non-reproducible context. We asked partcipants to generate simple plots from the Gapminder dataset and to write the documentation on how to reproduce it. They then gave the instructions to their neighbors that tried to make the plots. For this exercise, most participants resorted to Excel and they realized how challenging it can be to write detailed enough instructions to have someone repeat a plot. After this exercise, they get introduced to literate programing in R with knitr and Rmarkdown. Participants who discovered this for the first time let escape some “WOW”.
In the afternoon of the second day, Hilmar introduced participants best practices on how to name and organize files to facilitate reproducibility, while working with a slightly more realisitic knitr document. Participants got introduced to the benefit of programmatic modifications of the data. Overall, this module is really well received because it provides participants with many tips and best practices to organize files in research projects.
During the first workshop at Duke, participants repeatedly requested to learn more about version control, especially Git and GitHub. Karen Cranston improvised a quick demonstration, but for the second workshop Ciera put together a module (adapted from the lesson from Software Carpentry) using the GitHub GUI tool. Participants clearly understood the benefit of version control, and starting with the GUI was great to make it accessible to most participants. However, because of the variety of operating systems (and versions), it was difficult to provide instructions that worked for everyone, especially given that some participants had operating systems that did not support the GUI.
For the rest of the second day, François covered how to organize your code into function within your knitr document to automate the generation of the intermediate datasets, figures and manuscript. While most participants had had previous experience with R, few knew how to write functions. This was a good time to introduce them to this (sometimes) underrated approach at organizing code.
To finish these 2 days, Hilmar covered the different licenses and publishing platforms that allow researchers to make publicly available their manuscripts, code and data. Most participants had heard of Creative Commons licenses or Dryad but few knew enough to navigate this growing ecosystem.
Overall, the workshop was well received by participants. 87% indicated in the post-workshop survey that their ability to conduct reproducible was higher or much higher than prior to the workshop.
Some participants cancelled at the last minute and others who participated in the workshop didn’t fill ou the survey. Therefore, we don’t have an exact correspondance between the number of respondants and the participants.↩